streaming data
Learning Mixture of Gaussians with Streaming Data
In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of $N$ points in $d$ dimensions generated by an unknown mixture of $k$ spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd's heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians accurately if they are sufficiently separated. Assuming each pair of centers are $C\sigma$ distant with $C=\Omega((k\log k)^{1/4}\sigma)$ and where $\sigma^2$ is the maximum variance of any Gaussian component, we show that asymptotically the algorithm estimates the centers optimally (up to certain constants); our center separation requirement matches the best known result for spherical Gaussians \citep{vempalawang}. For finite samples, we show that a bias term based on the initial estimate decreases at $O(1/{\rm poly}(N))$ rate while variance decreases at nearly optimal rate of $\sigma^2 d/N$. Our analysis requires seeding the algorithm with a good initial estimate of the true cluster centers for which we provide an online PCA based clustering algorithm. Indeed, the asymptotic per-step time complexity of our algorithm is the optimal $d\cdot k$ while space complexity of our algorithm is $O(dk\log k)$. In addition to the bias and variance terms which tend to $0$, the hard-thresholding based updates of streaming Lloyd's algorithm is agnostic to the data distribution and hence incurs an \emph{approximation error} that cannot be avoided. However, by using a streaming version of the classical \emph{(soft-thresholding-based)} EM method that exploits the Gaussian distribution explicitly, we show that for a mixture of two Gaussians the true means can be estimated consistently, with estimation error decreasing at nearly optimal rate, and tending to $0$ for $N\rightarrow \infty$.
Variance-Reduced Stochastic Gradient Descent on Streaming Data
We present an algorithm STRSAGA for efficiently maintaining a machine learning model over data points that arrive over time, quickly updating the model as new training data is observed. We present a competitive analysis comparing the sub-optimality of the model maintained by STRSAGA with that of an offline algorithm that is given the entire data beforehand, and analyze the risk-competitiveness of STRSAGA under different arrival patterns. Our theoretical and experimental results show that the risk of STRSAGA is comparable to that of offline algorithms on a variety of input arrival patterns, and its experimental performance is significantly better than prior algorithms suited for streaming data, such as SGD and SSVRG.
Coresets for k-Segmentation of Streaming Data
Life-logging video streams, financial time series, and Twitter tweets are a few examples of high-dimensional signals over practically unbounded time. We consider the problem of computing optimal segmentation of such signals by k-piecewise linear function, using only one pass over the data by maintaining a coreset for the signal. The coreset enables fast further analysis such as automatic summarization and analysis of such signals. A coreset (core-set) is a compact representation of the data seen so far, which approximates the data well for a specific task -- in our case, segmentation of the stream. We show that, perhaps surprisingly, the segmentation problem admits coresets of cardinality only linear in the number of segments k, independently of both the dimension d of the signal, and its number n of points. More precisely, we construct a representation of size O(klog n /eps^2) that provides a (1+eps)-approximation for the sum of squared distances to any given k-piecewise linear function. Moreover, such coresets can be constructed in a parallel streaming approach. Our results rely on a novel eduction of statistical estimations to problems in computational geometry. We empirically evaluate our algorithms on very large synthetic and real data sets from GPS, video and financial domains, using 255 machines in Amazon cloud.
Stochastic Optimization Algorithms for Instrumental Variable Regression with Streaming Data
We develop and analyze algorithms for instrumental variable regression by viewing the problem as a conditional stochastic optimization problem. In the context of least-squares instrumental variable regression, our algorithms neither require matrix inversions nor mini-batches thereby providing a fully online approach for performing instrumental variable regression with streaming data. When the true model is linear, we derive rates of convergence in expectation, that are of order \mathcal{O}(\log T/T) and \mathcal{O}(1/T {1-\epsilon}) for any \epsilon 0, respectively under the availability of two-sample and one-sample oracles respectively. Importantly, under the availability of the two-sample oracle, the aforementioned rate is actually agnostic to the relationship between confounder and the instrumental variable demonstrating the flexibility of the proposed approach in alleviating the need for explicit model assumptions required in recent works based on reformulating the problem as min-max optimization problems. Experimental validation is provided to demonstrate the advantages of the proposed algorithms over classical approaches like the 2SLS method.
Reviews: Learning Mixture of Gaussians with Streaming Data
This paper proposes a LLoyd type method with PCA initialization to estimate means of Gaussians in a streaming setting. References seem to be severely lacking as the domain is wide. The key point of the algorithm seems to be the initialization and there is no discussion on this part (comparison with the literature on k-means initialization). While the technical results might be interesting I have difficulties commenting on them without the proper background on the subject. I find as well that the technical exposition of proofs is not very clear (notations, chaining of arguments).
Reviews: Variance-Reduced Stochastic Gradient Descent on Streaming Data
This paper considers the problem of streaming stochastic optimization taking into account the arrival patterns of examples in time. Whereas the relevant previous work focuses on learning from a stream (i.e. in one pass over the data, with O(dimension) memory), this work attempts to utilize the time spent between the arrival of examples in order to revisit previously encountered examples. The problem description is novel and appealing, both in its possible practical relevance and as an interesting new model for consideration in algorithm analysis. However, there are central issues with the comparisons that the paper draws between algorithms, both conceptually and experimentally. Conceptually, it seems incorrect to say that STRSAGA is a streaming algorithm, and in turn in to repeatedly compare it to one (such as SSVRG), since STRSAGA's memory complexity actually grows linearly with the dataset size (in order to maintain its "effective sample set").
Learning from Streaming Data when Users Choose
Moreover, due to the data-driven nature of digital platforms, interesting dynamics emerge among users and service In digital markets comprised of many competing providers: on the one hand, users choose amongst services, each user chooses between multiple providers based on the quality of their services; on the other service providers according to their preferences, hand, providers use the user data to improve and update and the chosen service makes use of the user data their services, affecting future user choices (Ginart et al., to incrementally improve its model. The service 2021; Kwon et al., 2022; Dean et al., 2024; Jagadeesan et al., providers' models influence which service the 2023a). For example, in personalized music streaming platform, user will choose at the next time step, and the a user chooses amongst different music streaming user's choice, in return, influences the model update, platforms based on how well they meet the user's needs.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California (0.04)
- (2 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
T-SaS: Toward Shift-aware Dynamic Adaptation for Streaming Data
Ren, Weijieying, Zhao, Tianxiang, Qin, Wei, Liu, Kunpeng
In many real-world scenarios, distribution shifts exist in the streaming data across time steps. Many complex sequential data can be effectively divided into distinct regimes that exhibit persistent dynamics. Discovering the shifted behaviors and the evolving patterns underlying the streaming data are important to understand the dynamic system. Existing methods typically train one robust model to work for the evolving data of distinct distributions or sequentially adapt the model utilizing explicitly given regime boundaries. However, there are two challenges: (1) shifts in data streams could happen drastically and abruptly without precursors. Boundaries of distribution shifts are usually unavailable, and (2) training a shared model for all domains could fail to capture varying patterns. This paper aims to solve the problem of sequential data modeling in the presence of sudden distribution shifts that occur without any precursors. Specifically, we design a Bayesian framework, dubbed as T-SaS, with a discrete distribution-modeling variable to capture abrupt shifts of data. Then, we design a model that enable adaptation with dynamic network selection conditioned on that discrete variable. The proposed method learns specific model parameters for each distribution by learning which neurons should be activated in the full network. A dynamic masking strategy is adopted here to support inter-distribution transfer through the overlapping of a set of sparse networks. Extensive experiments show that our proposed method is superior in both accurately detecting shift boundaries to get segments of varying distributions and effectively adapting to downstream forecast or classification tasks.
Autoencoder-based Anomaly Detection in Streaming Data with Incremental Learning and Concept Drift Adaptation
Li, Jin, Malialis, Kleanthis, Polycarpou, Marios M.
In our digital universe nowadays, enormous amount of data are produced in a streaming manner in a variety of application areas. These data are often unlabelled. In this case, identifying infrequent events, such as anomalies, poses a great challenge. This problem becomes even more difficult in non-stationary environments, which can cause deterioration of the predictive performance of a model. To address the above challenges, the paper proposes an autoencoder-based incremental learning method with drift detection (strAEm++DD). Our proposed method strAEm++DD leverages on the advantages of both incremental learning and drift detection. We conduct an experimental study using real-world and synthetic datasets with severe or extreme class imbalance, and provide an empirical analysis of strAEm++DD. We further conduct a comparative study, showing that the proposed method significantly outperforms existing baseline and advanced methods.
How to Learn From Streaming Data with Creme in Python?
In a traditional paradigm of machine learning, we often work in the offline learning fashion where we start with data preprocessing and end with data modelling with an algorithm to satisfy the requirements. This becomes a storage-dependent and time-consuming process. To overcome this, we can use streaming data for predictive analysis or any other modelling process. We don't need to store the data before modelling it. This can be accomplished by stream learning and online learning.